Flow-based Generative Model
   HOME

TheInfoList



OR:

A flow-based generative model is a
generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...
used in
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
that explicitly models a
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one. The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation. In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and
generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...
do not explicitly represent the likelihood function.


Method

Let z_0 be a (possibly multivariate)
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
with distribution p_0(z_0). For i = 1, ..., K, let z_i = f_i(z_) be a sequence of random variables transformed from z_0. The functions f_1, ..., f_K should be invertible, i.e. the
inverse function In mathematics, the inverse function of a function (also called the inverse of ) is a function that undoes the operation of . The inverse of exists if and only if is bijective, and if it exists, is denoted by f^ . For a function f\colon ...
f^_i exists. The final output z_K models the target distribution. The log likelihood of z_K is (see
derivation Derivation may refer to: Language * Morphological derivation, a word-formation process * Parse tree or concrete syntax tree, representing a string's syntax in formal grammars Law * Derivative work, in copyright law * Derivation proceeding, a ...
): : \log p_K(z_K) = \log p_0(z_0) - \sum_^ \log \left, \det \frac\ To efficiently compute the log likelihood, the functions f_1, ..., f_K should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions f_1, ..., f_K are modeled using
deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE, RealNVP, and Glow.


Derivation of log likelihood

Consider z_1 and z_0. Note that z_0 = f^_1(z_1). By the change of variable formula, the distribution of z_1 is: : p_1(z_1) = p_0(z_0)\left, \det \frac\ Where \det \frac is the
determinant In mathematics, the determinant is a Scalar (mathematics), scalar-valued function (mathematics), function of the entries of a square matrix. The determinant of a matrix is commonly denoted , , or . Its value characterizes some properties of the ...
of the
Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. If this matrix is square, that is, if the number of variables equals the number of component ...
of f^_1. By the
inverse function theorem In mathematics, the inverse function theorem is a theorem that asserts that, if a real function ''f'' has a continuous derivative near a point where its derivative is nonzero, then, near this point, ''f'' has an inverse function. The inverse fu ...
: : p_1(z_1) = p_0(z_0)\left, \det \left(\frac\right)^\ By the identity \det(A^) = \det(A)^ (where A is an
invertible matrix In linear algebra, an invertible matrix (''non-singular'', ''non-degenarate'' or ''regular'') is a square matrix that has an inverse. In other words, if some other matrix is multiplied by the invertible matrix, the result can be multiplied by a ...
), we have: : p_1(z_1) = p_0(z_0)\left, \det \frac\^ The log likelihood is thus: : \log p_1(z_1) = \log p_0(z_0) - \log \left, \det \frac\ In general, the above applies to any z_i and z_. Since \log p_i(z_i) is equal to \log p_(z_) subtracted by a non-recursive term, we can infer by induction that: : \log p_K(z_K) = \log p_0(z_0) - \sum_^ \log \left, \det \frac\


Training method

As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
between the model's likelihood and the target distribution to be estimated. Denoting p_\theta the model's likelihood and p^* the target distribution to learn, the (forward) KL-divergence is: : D_ p_(x)= -\mathop_ log p_(x)+ \mathop_ log p^(x) /math> The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter \theta we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by
importance sampling Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally at ...
. Indeed, if we have a dataset \_^N of samples each independently drawn from the target distribution p^*(x), then this term can be estimated as: : -\hat_ log p_(x)= -\frac \sum_^ \log p_(x_) Therefore, the learning objective : \underset\ D_ p_(x)/math> is replaced by : \underset\ \sum_^ \log p_(x_) In other words, minimizing the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution. A pseudocode for training normalizing flows is as follows: * INPUT. dataset x_, normalizing flow model f_\theta(\cdot), p_0 . * SOLVE. \max_\theta \sum_j \ln p_\theta(x_j) by gradient descent * RETURN. \hat\theta


Variants


Planar Flow

The earliest example. Fix some activation function h, and let \theta = (u, w, b) with the appropriate dimensions, thenx = f_\theta(z) = z + u h(\langle w, z \rangle + b)The inverse f_\theta^ has no closed-form solution in general. The Jacobian is , \det (I + h'(\langle w, z \rangle + b) uw^T), = , 1 + h'(\langle w, z \rangle + b) \langle u, w\rangle, . For it to be invertible everywhere, it must be nonzero everywhere. For example, h = \tanh and \langle u, w \rangle > -1 satisfies the requirement.


Nonlinear Independent Components Estimation (NICE)

Let x, z\in \R^ be even-dimensional, and split them in the middle. Then the normalizing flow functions arex = \begin x_1 \\ x_2 \end= f_\theta(z) = \begin z_1 \\z_2 \end + \begin 0 \\ m_\theta(z_1) \endwhere m_\theta is any neural network with weights \theta. f_\theta^ is just z_1 = x_1, z_2 = x_2 - m_\theta(x_1), and the Jacobian is just 1, that is, the flow is volume-preserving. When n=1, this is seen as a curvy shearing along the x_2 direction.


Real Non-Volume Preserving (Real NVP)

The Real Non-Volume Preserving model generalizes NICE model by:x = \begin x_1 \\ x_2 \end= f_\theta(z) = \begin z_1 \\ e^ \odot z_2 \end + \begin 0 \\ m_\theta(z_1) \end Its inverse is z_1 = x_1, z_2 = e^\odot (x_2 - m_\theta (x_1)), and its Jacobian is \prod^n_ e^. The NICE model is recovered by setting s_\theta = 0. Since the Real NVP map keeps the first and second halves of the vector x separate, it's usually required to add a permutation (x_1, x_2) \mapsto (x_2, x_1) after every Real NVP layer.


Generative Flow (Glow)

In generative flow model, each layer has 3 parts: * channel-wise affine transformy_ = s_c(x_ + b_c)with Jacobian \prod_c s_c^. * invertible 1x1 convolutionz_ = \sum_ K_ y_with Jacobian \det(K)^. Here K is any invertible matrix. * Real NVP, with Jacobian as described in Real NVP. The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.


Masked Autoregressive Flow (MAF)

An autoregressive model of a distribution on \R^n is defined as the following stochastic process: \begin x_1 \sim& N(\mu_1, \sigma_1^2)\\ x_2 \sim& N(\mu_2(x_1), \sigma_2(x_1)^2)\\ &\cdots \\ x_n \sim& N(\mu_n(x_), \sigma_n(x_)^2)\\ \endwhere \mu_i: \R^ \to \R and \sigma_i: \R^ \to (0, \infty) are fixed functions that define the autoregressive model. By the
reparameterization trick The reparameterization trick (aka "reparameterization gradient estimator") is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the effici ...
, the autoregressive model is generalized to a normalizing flow:\begin x_1 =& \mu_1 + \sigma_1 z_1\\ x_2 =& \mu_2(x_1) + \sigma_2(x_1) z_2\\ &\cdots \\ x_n =& \mu_n(x_) + \sigma_n(x_) z_n\\ \endThe autoregressive model is recovered by setting z \sim N(0, I_). The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel). The Jacobian matrix is lower-diagonal, so the Jacobian is \sigma_1 \sigma_2(x_1)\cdots \sigma_n(x_). Reversing the two maps f_\theta and f_\theta^ of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.


Continuous Normalizing Flow (CNF)

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic. Let z_0 be the latent variable with distribution p(z_0). Map this latent variable to data space with the following flow function: : x = F(z_0) = z_T = z_0 + \int_0^T f(z_t, t) dt where f is an arbitrary function and can be modeled with e.g. neural networks. The inverse function is then naturally: : z_0 = F^(x) = z_T + \int_T^0 f(z_t, t) dt = z_T - \int_0^T f(z_t,t) dt And the log-likelihood of x can be found as: : \log(p(x)) = \log(p(z_0)) - \int_0^T \text\left frac \rightdt Since the trace depends only on the diagonal of the Jacobian \partial_ f, this allows "free-form" Jacobian. Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently. The trace can be estimated by "Hutchinson's trick":
Given any matrix W\in \R^, and any random u\in \R^n with E u^T= I, we have E
^T W u In computing, a Control key is a modifier key which, when pressed in conjunction with another key, performs a special operation (for example, ). Similarly to the Shift key, the Control key rarely performs any function when pressed by itself. ...
= tr(W). (Proof: expand the expectation directly.)
Usually, the random vector is sampled from N(0, I) (normal distribution) or \^n ( Radamacher distribution). When f is implemented as a neural network, neural ODE methods would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE. There are two main deficiencies of CNF, one is that a continuous flow must be a
homeomorphism In mathematics and more specifically in topology, a homeomorphism ( from Greek roots meaning "similar shape", named by Henri Poincaré), also called topological isomorphism, or bicontinuous function, is a bijective and continuous function ...
, thus preserve orientation and
ambient isotopy In the mathematical subject of topology, an ambient isotopy, also called an ''h-isotopy'', is a kind of continuous distortion of an ambient space, for example a manifold, taking a submanifold to another submanifold. For example in knot theory, o ...
(for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow f might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible f that all solve the same problem). By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE". Any homeomorphism of \R^n can be approximated by a neural ODE operating on \R^, proved by combining
Whitney embedding theorem In mathematics, particularly in differential topology, there are two Whitney embedding theorems, named after Hassler Whitney: *The strong Whitney embedding theorem states that any smooth real - dimensional manifold (required also to be Hausdorf ...
for manifolds and the
universal approximation theorem In the mathematical theory of artificial neural networks, universal approximation theorems are theorems of the following form: Given a family of neural networks, for each function f from a certain function space, there exists a sequence of neural ...
for neural networks. To regularize the flow f, one can impose regularization losses. The paper proposed the following regularization loss based on optimal transport theory:\lambda_ \int_^\left\, f(z_t, t)\right\, ^ dt +\lambda_ \int_^\left\, \nabla_z f(z_t, t)\right\, _F^ dt where \lambda_K, \lambda_J >0 are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.


Flows on manifolds

When a probabilistic flow transforms a distribution on an m-dimensional
smooth manifold In mathematics, a differentiable manifold (also differential manifold) is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One may ...
embedded in \R^n, where m, and where the transformation is specified as a function, \R^n\to\R^n, the scaling factor between the source and transformed PDFs is ''not'' given by the naive computation of the determinant of the n\textn Jacobian (which is zero), but instead by the determinant(s) of one or more suitably defined m\textm matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023), where the more general case of non-isometrically embedded Riemann manifolds is also treated. Here we restrict attention to isometrically embedded manifolds. As running examples of manifolds with smooth, isometric embedding in \R^n we shall use: * The unit hypersphere: \mathbb S^=\, where flows can be used to generalize e.g. Von Mises-Fisher or uniform spherical distributions. * The
simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...
interior: \Delta^=\, where n-way categorical distributions live; and where flows can be used to generalize e.g.
Dirichlet Johann Peter Gustav Lejeune Dirichlet (; ; 13 February 1805 – 5 May 1859) was a German mathematician. In number theory, he proved special cases of Fermat's last theorem and created analytic number theory. In Mathematical analysis, analysis, h ...
, or uniform simplex distributions. As a first example of a spherical manifold flow transform, consider the normalized linear transform, which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the n\textn invertible matrix \mathbf M: : f_\text(\mathbf x; \mathbf M) = \frac In full Euclidean space, f_\text:\R^n\to\R^n is ''not'' invertible, but if we restrict the domain and co-domain to the unitsphere, then f_\text:\mathbb S^\to\mathbb S^ ''is'' invertible (more specifically it is a
bijection In mathematics, a bijection, bijective function, or one-to-one correspondence is a function between two sets such that each element of the second set (the codomain) is the image of exactly one element of the first set (the domain). Equival ...
and a
homeomorphism In mathematics and more specifically in topology, a homeomorphism ( from Greek roots meaning "similar shape", named by Henri Poincaré), also called topological isomorphism, or bicontinuous function, is a bijective and continuous function ...
and a
diffeomorphism In mathematics, a diffeomorphism is an isomorphism of differentiable manifolds. It is an invertible function that maps one differentiable manifold to another such that both the function and its inverse are continuously differentiable. Definit ...
), with inverse f_\text(\cdot\,;\mathbf M^) . The Jacobian of f_\text:\R^n\to\R^n, at \mathbf y=f_\text(\mathbf x;\mathbf M) is \lVert\mathbf\rVert^(\mathbf I_n -\mathbf')\mathbf M, which has rank n-1 and determinant of zero; while as explained here, the factor (see subsection below) relating source and transformed densities is: \lVert\mathbf\rVert^\left, \operatorname\mathbf M\.


Differential volume ratio

For m, let \mathcal M\subset\R^n be an m-dimensional manifold with a smooth, isometric embedding into \R^n. Let f:\R^n\to\R^n be a smooth flow transform with range restricted to \mathcal M. Let \mathbf x\in\mathcal M be sampled from a distribution with density P_X. Let \mathbf y=f(\mathbf x), with resultant (pushforward) density P_Y. Let U\subset\mathcal M be a small, convex region containing \mathbf x and let V=f(U) be its image, which contains \mathbf y; then by conservation of probability mass: : P_X(\mathbf x)\operatorname(U)\approx P_Y(\mathbf y)\operatorname(V) where volume (for very small regions) is given by
Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of higher dimensional Euclidean '-spaces. For lower dimensions or , it c ...
in m-dimensional
tangent space In mathematics, the tangent space of a manifold is a generalization of to curves in two-dimensional space and to surfaces in three-dimensional space in higher dimensions. In the context of physics the tangent space to a manifold at a point can be ...
. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the differential volume ratio. To obtain concrete formulas for volume on the m-dimensional manifold, we construct U by mapping an m-dimensional rectangle in (local) coordinate space to the manifold via a smooth embedding function: \R^m\to\R^n. At very small scale, the embedding function becomes essentially linear so that U is a parallelotope (multidimensional generalization of a parallelogram). Similarly, the flow transform, f becomes linear, so that the image, V=f(U) is also a parallelotope. In \R^m, we can represent an m-dimensional parallelotope with an m\textm matrix whose colum-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The volume is given by the absolute value of the determinant of this matrix. If more generally (as is the case here), an m-dimensional paralellotope is embedded in \R^n, it can be represented with a (tall) n\textm matrix, say \mathbf V. Denoting the parallelotope as /\mathbf V\!/, its volume is then given by the square root of the
Gram determinant In linear algebra, the Gram matrix (or Gramian matrix, Gramian) of a set of vectors v_1,\dots, v_n in an inner product space is the Hermitian matrix of inner products, whose entries are given by the inner product G_ = \left\langle v_i, v_j \right\r ...
: : \operatorname/\mathbf V\!/=\sqrt In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.


Simplex flow

As a first example, we develop expressions for the differential volume ratio of a simplex flow, \mathbf q=f(\mathbf p), where \mathbf p, \mathbf q\in\mathcal M=\Delta^. Define the embedding function: :e:\tilde\mathbf p=(p_1\dots,p_)\mapsto\mathbf p=(p_1\dots,p_,1-\sum_^p_i) which maps a conveniently chosen, (n-1)-dimensional repesentation, \tilde\mathbf p, to the embedded manifold. The n\text(n-1) Jacobian is \mathbf E = \begin \mathbf_ \\ -\boldsymbol' \end . To define U, the differential volume element at the transformation input (\mathbf p\in\Delta^), we start with a rectangle in \tilde\mathbf p-space, having (signed) differential side-lengths, dp_1, \dots, dp_ from which we form the square diagonal matrix \mathbf D, the columns of which span the rectangle. At very small scale, we get U=e(\mathbf D)=/\mathbf\!/, with: :\operatorname(U) = \sqrt = \sqrt\, \left, \operatorname\mathbf D)\ =\sqrt n\prod_^ \left, dp_i\ To understand the geometric interpretation of the factor \sqrt, see the example for the 1-simplex in the diagram at right. The differential volume element at the transformation output (\mathbf q\in\Delta^), is the parallelotope, V=f(U)=/\mathbf\!/, where \mathbf is the n\textn Jacobian of f at \mathbf p=e(\tilde\mathbf p). Its volume is: : \operatorname(V) = \sqrt = \sqrt\, \left, \operatorname\mathbf D)\ so that the factor \left, \operatorname\mathbf D)\ cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the representation function, r:\mathbf p\mapsto\tilde\mathbf p, which simply extracts the first (n-1) components. The Jacobian is \mathbf R=\begin\mathbf I_n&\boldsymbol0\end. Observe that, since e\circ r\circ f=f, the chain rule for function composition gives: \mathbf=\mathbf. By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor \sqrt=\sqrt n, which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, r\circ f\circ e: : R^\Delta_f(\mathbf p)=\frac = \left, \operatorname(\mathbf)\ which, if \mathbf p\sim P_\mathbf P, can be used to derive the pushforward density after a change of variables, \mathbf q = f(\mathbf p): : P_(\mathbf q) = \frac\,,\;\text\;\; \mathbf p=f^(\mathbf q) This formula is valid only because the simplex is flat and the Jacobian, \mathbf E is constant. The more general case for curved manifolds is discussed below, after we present two concrete examples of simplex flow transforms.


Simplex calibration transform

A calibration transform, f_\text:\Delta^\to\Delta^, which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic n-class classifier, uses the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For \mathbf p, \mathbf q\in\Delta^ and with parameters, a\ne0 and \mathbf c\in\R^n the transform can be specified as: : \mathbf q=f_\text(\mathbf p; a, \mathbf c) = \operatorname(a^\log\mathbf p+\mathbf c)\;\iff\; \mathbf p=f^_\text(\mathbf q; a, \mathbf c) = \operatorname(a\log\mathbf q-a\mathbf c) where the log is applied elementwise. After some algebra the differential volume ratio can be expressed as: : R^\Delta_\text(\mathbf p; a, \mathbf c) = \left, \operatorname(\mathbf)\ = \left, a\^\prod_^n\frac * This result can also be obtained by factoring the density of the SGB distribution, which is obtained by sending
Dirichlet Johann Peter Gustav Lejeune Dirichlet (; ; 13 February 1805 – 5 May 1859) was a German mathematician. In number theory, he proved special cases of Fermat's last theorem and created analytic number theory. In Mathematical analysis, analysis, h ...
variates through f_\text. While calibration transforms are most often trained as discriminative models, the reinterpretation here as a probabilistic flow allows also the design of generative calibration models based on this transform. When used for calibration, the restriction a>0 can be imposed to prevent direction reversal in log-probability space. With the additional restriction \mathbf c=\boldsymbol0, this transform (with discriminative training) is known in machine learning as temperature scaling.


Generalized calibration transform

The above calibration transform can be generalized to f_\text:\Delta^\to\Delta^, with parameters \mathbf c\in\R^n and \mathbf A n\textn invertible: : \mathbf q = f_\text(\mathbf p;\mathbf A,\mathbf c) = \operatorname(\mathbf A\log\mathbf p + \mathbf c)\,,\;\text\; \mathbf=\lambda\mathbf1 where the condition that \mathbf A has \mathbf1 as an
eigenvector In linear algebra, an eigenvector ( ) or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \mathbf v of a linear transformation T is scaled by ...
ensures invertibility by sidestepping the information loss due to the invariance: \operatorname(\mathbf x+\alpha\mathbf1)=\operatorname(\mathbf x). Note in particular that \mathbf A=\lambda\mathbf I_n is the ''only'' allowed diagonal parametrization, in which case we recover f_\text(\mathbf p;\lambda^,\mathbf c), while (for n>2) generalization ''is'' possible with non-diagonal matrices. The inverse is: : \mathbf p = f_\text^(\mathbf q;\mathbf A, \mathbf c) = f_\text(\mathbf q;\mathbf A^, -\mathbf A^\mathbf c)\,,\;\text\; \mathbf=\lambda\mathbf1\Longrightarrow\mathbf^\mathbf1=\lambda^\mathbf1 The differential volume ratio is: : R^\Delta_\text(\mathbf p;\mathbf A,\mathbf c) =\frac\prod_^n\frac If f_\text is to be used as a calibration transform, further constraint could be imposed, for example that \mathbf A be
positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form In mathematics, a bilinear form is a bilinear map on a vector space (the elements of which are called '' vectors'') over a field ''K'' (the elements of w ...
, so that (\mathbf)'\mathbf x>0, which avoids direction reversals. (This is one possible generalization of a>0 in the f_\text parameter.) For n=2, a>0 and \mathbf A positive definite, then f_\text and f_\text are equivalent in the sense that in both cases, \log\frac\mapsto\log\frac is a straight line, the (positive) slope and offset of which are functions of the transform parameters. For n>2, f_\text ''does'' generalize f_\text. It must however be noted that chaining mutliple f_\text flow transformations does ''not'' give a further generalization, because: : f_\text(\cdot\,;\mathbf A_1,\mathbf c_1) \circ f_\text(\cdot\,;\mathbf A_2,\mathbf c_2) = f_\text(\cdot\,;\mathbf A_1\mathbf A_2,\mathbf c_1+\mathbf A_1\mathbf c_2) In fact, the set of f_\text transformations form a
group A group is a number of persons or things that are located, gathered, or classed together. Groups of people * Cultural group, a group whose members share the same cultural identity * Ethnic group, a group whose members share the same ethnic iden ...
under function composition. The set of f_\text transformations form a subgroup. Also see: Dirichlet calibration, which generalizes f_\text, by not placing any restriction on the matrix, \mathbf A, so that invertibility is not guaranteed. While Dirichlet calibration is trained as a discriminative model, f_\text can also be trained as part of a generative calibration model.


Differential volume ratio for curved manifolds

Consider a flow, \mathbf y=f(\mathbf x) on a curved manifold, for example \mathbb S^ which we equip with the embedding function, e that maps a set of (n-1) angular spherical coordinates to \mathbb S^. The Jacobian of e is non-constant and we have to evaluate it at both input (\mathbf ) and output (\mathbf ). The same applies to r, the represententation function that recovers spherical coordinates from points on \mathbb S^, for which we need the Jacobian at the output (\mathbf). The differential volume ratio now generalizes to: : R_f(\mathbf x) = \left, \operatorname(\mathbf)\\,\frac For geometric insight, consider \mathbf S^2, where the spherical coordinates are co-latitude, \theta\in ,\pi/math> and longitude, \phi\in idempotent linear projection onto the local tangent space (''orthogonal'' for the unitsphere: \mathbf I_n-\mathbf'; ''oblique'' for the simplex: \mathbf I_n-\boldsymbol'). The colums of \boldsymbol span the m-dimensional tangent space at \mathbf x. We use the notation, \mathbf for any n\textm matrix with orthonormal columns (\mathbf T_'\mathbf=\mathbf I_m) that span the local tangent space. Also note: \boldsymbol\mathbf=\mathbf. We can now choose our local coordinate embedding function, e_\mathbf x:\R^m\to\R^n: : e_\mathbf x(\tilde x) = \pi(\mathbf x + \mathbf)\,, \text\,\mathbf=\mathbf\,\text\,\tilde\mathbf x=\mathbf0. Since the Jacobian is injective (full rank: m), a local (not necessarily unique) left inverse, say r^*_\mathbf x with Jacobian \mathbf R^*_\mathbf x, exists such that r^*_\mathbf x(e_\mathbf x(\tilde x))=\tilde x and \mathbf R^*_\mathbf x\mathbf=\mathbf I_m. In practice we do not need the left inverse function itself, but we ''do'' need its Jacobian, for which the above equation does not give a unique solution. We can however enforce a unique solution for the Jacobian by choosing the left inverse as, r_\mathbf x:\R^n\to\R^m: : r_\mathbf x(\mathbf z) = r^*_\mathbf x(\pi(\mathbf z))\,,\text\, \mathbf=\mathbf T_\mathbf x' We can now finally plug \mathbf=\mathbf and \mathbf=\mathbf T_\mathbf y' into our previous expression for R_f, the differential volume ratio, which because of the orthonormal Jacobians, simplifies to: : R_f(\mathbf x) = \left, \operatorname(\mathbf'\mathbf)\


Practical implementation

For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio, R_f, or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access to R_f itself. Practical solutions include: *Sorrenson et al.(2023) give a solution for computationally efficient stochastic parameter gradient approximation for \log R_f. *For some hand-designed flow transforms, R_f can be analytically derived in closed form, for example the above-mentioned simplex calibration transforms. Futher examples are given below in the section on simple spherical flows. *On a software platform equipped with
linear algebra Linear algebra is the branch of mathematics concerning linear equations such as :a_1x_1+\cdots +a_nx_n=b, linear maps such as :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrix (mathemat ...
and
automatic differentiation In mathematics and computer algebra, automatic differentiation (auto-differentiation, autodiff, or AD), also called algorithmic differentiation, computational differentiation, and differentiation arithmetic Hend Dawood and Nefertiti Megahed (2023) ...
, R_f(\mathbf x) = \left, \operatorname(\mathbf'\mathbf)\ can be automatically evaluated, given access to only \mathbf x, f, \pi. But this is expensive for high-dimensional data, with at least \mathcal O(n^3) computational costs. Even then, the slow automatic solution can be invaluable as a tool for numerically verifying hand-designed closed-form solutions.


Simple spherical flows

In machine learning literature, various complex spherical flows formed by deep neural network architectures may be found. In contrast, this section compiles from ''statistics'' literature the details of three very simple spherical flow transforms, with simple closed-form expressions for inverses and differential volume ratios. These flows can be used individually, or chained, to generalize distributions on the unitsphere, \mathbb S^. All three flows are compositions of an invertible affine transform in \R^n, followed by radial projection back onto the sphere. The flavours we consider for the affine transform are: pure translation, pure linear and general affine. To make these flows fully functional for learning, inference and sampling, the tasks are: * To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility. * To derive in simple closed form the differential volume ratio, R_f. An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps suprisingly flexible distributions.


Normalized translation flow

The normalized translation flow, f_\text:\mathbb S^\to\mathbb S^, with parameter \mathbf c\in\R^n, is given by: : \mathbf y = f_\text(\mathbf x;\mathbf c) =\frac\,, \;\text\;\lVert\mathbf c\rVert < 1 The inverse function may be derived by considering, for \ell>0: \mathbf y=\ell^(\mathbf x+\mathbf c) and then using \mathbf x'\mathbf x=1 to get a
quadratic equation In mathematics, a quadratic equation () is an equation that can be rearranged in standard form as ax^2 + bx + c = 0\,, where the variable (mathematics), variable represents an unknown number, and , , and represent known numbers, where . (If and ...
to recover \ell, which gives: : \mathbf x = f^_\text(\mathbf y;\mathbf c) = \ell\mathbf y - \mathbf c\,,\text\; \ell = \mathbf y'\mathbf c +\sqrt from which we see that we need \lVert\mathbf c\rVert < 1 to keep \ell real and positive for all \mathbf y\in\mathbb S^. The differential volume ratio is given (without derivation) by Boulerice & Ducharme(1994) as: : R_\text(\mathbf x;\mathbf c) = \frac This can indeed be verified analytically: *By a laborious manipulation of R_f(\mathbf x) = \left, \operatorname(\mathbf'\mathbf)\. *By setting \mathbf M=\mathbf I_n in R_\text(\mathbf x;\mathbf M, \mathbf c), which is given below. Finally, it is worth noting that f_\text and f^_\text do not have the same functional form.


Normalized linear flow

The normalized linear flow, f_\text:\mathbb S^\to\mathbb S^, where parameter \mathbf M is an invertible n\textn matrix, is given by: : \mathbf y = f_\text(\mathbf x;\mathbf M) =\frac \;\iff\; \mathbf x = f^_\text(\mathbf y;\mathbf M) = f_\text(\mathbf y;\mathbf M^) =\frac The differential volume ratio is: : R_\text(\mathbf x; \mathbf M) = \frac This result can be derived indirectly via the Angular central Gaussian distribution (ACG), which can be obtained via normalized linear transform of either Gaussian, or uniform spherical variates. The first relationship can be used to derive the ACG density by a marginalization integral over the radius; after which the second relationship can be used to factor out the differential volume ratio. For details, see ACG distribution.


Normalized affine flow

The normalized affine flow, f_\text:\mathbb S^\to\mathbb S^, with parameters \mathbf c\in\R^n and \mathbf M, n\textn invertible, is given by: : f_\text(\mathbf x;\mathbf M, \mathbf c) =\frac\,, \;\text\;\lVert\mathbf\rVert < 1 The inverse function, derived in a similar way to the normalized translation inverse is: : \mathbf x = f^_\text(\mathbf y;\mathbf M,\mathbf c) = \mathbf M^(\ell\mathbf y - \mathbf c)\,,\text\; \ell = \frac where \mathbf W=(\mathbf')^. The differential volume ratio is: : R_\text(\mathbf x; \mathbf M, \mathbf c) =R_\text(\mathbf x; \mathbf M+\mathbf c\mathbf x') = \frac The final RHS numerator was expanded from \operatorname(\mathbf M + \mathbf') by the
matrix determinant lemma In mathematics, in particular linear algebra, the matrix determinant lemma computes the determinant of the sum of an invertible matrix A and the dyadic product, uvT, of a column vector u and a row vector vT. Statement Suppose A is an invertible ...
. Recalling R_f(\mathbf x)=\left, \operatorname(\mathbf T_\mathbf y'\mathbf)\, the equality between R_\text and R_\text holds because not only: :\mathbf x'\mathbf x=1\;\Longrightarrow\;\mathbf y = f_\text(\mathbf x; \mathbf)=f_\text(\mathbf x; \mathbf') but also, by orthogonality of \mathbf x to the local tangent space: : \mathbf x'\mathbf=\boldsymbol0\;\Longrightarrow\;\mathbf F_\mathbf x^\text\mathbf = \mathbf F_\mathbf x^\text\mathbf where \mathbf F_\mathbf x^\text=\lVert\mathbf+\mathbf c\rVert^(\mathbf I_n-\mathbf')(\mathbf') is the Jacobian of f_\text differentiated w.r.t. its input, but ''not'' also w.r.t. to its parameter.


Downsides

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them. Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set). Some hypotheses were formulated to explain this phenomenon, among which the
typical set In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asympt ...
hypothesis, estimation issues when training models, or fundamental issues due to the entropy of the data distributions. One of the most interesting properties of normalizing flows is the
invertibility In linear algebra, an invertible matrix (''non-singular'', ''non-degenarate'' or ''regular'') is a square matrix that has an inverse. In other words, if some other matrix is multiplied by the invertible matrix, the result can be multiplied by an ...
of their learned
bijective In mathematics, a bijection, bijective function, or one-to-one correspondence is a function between two sets such that each element of the second set (the codomain) is the image of exactly one element of the first set (the domain). Equival ...
map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.


Applications

Flow-based generative models have been applied on a variety of modeling tasks, including: * Audio generation * Image generation * Molecular graph generation * Point-cloud modeling * Video generation * Lossy image compression * Anomaly detection


References

{{reflist


External links


Flow-based Deep Generative Models

Normalizing flow models
Machine learning Statistical models Probabilistic models